{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# _Regression with kernel smoothing_\n",
    "## Or\n",
    "## \"Mr. Smoothie's Smoothie machine Smoother\", a Story in Seven Parts"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# What are we doing here?\n",
    "\n",
    "We're doing regression!\n",
    "\n",
    "* start with kNN\n",
    "* generalize how the points in the training data are accounted for\n",
    "* generalize the way that the regression function is calculated \n",
    "* discuss the problems\n",
    "\n",
    "We will use simple data structures and lots of visualization to give a clear picture of what is happening in each step. $\\texttt{scikit-learn}$ will be used only when the implementation gets complicated.\n",
    "\n",
    "This is basically a reproductions of Essentials of Statistical Learning (ESL), Chapter 6, first few sections.\n",
    "\n",
    "\n",
    "# Background\n",
    "\n",
    "Presume a relationship between input data `X` and and target data `Y`:\n",
    "\n",
    "$Y = f(X) + \\textrm{noise}$\n",
    "\n",
    "The goal in regression is to calculate a function $\\hat{f}(X)$ that is a good estimate of $f(X)$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import sys\n",
    "import heapq\n",
    "\n",
    "import sklearn\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "from scipy.stats import norm\n",
    "\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Start with some simple examples of data distributions.\n",
    "\n",
    "Remember: all parameter choices _matter_! We'll study the effects of our choices later."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "n_pts = 70 # number of data points\n",
    "X = np.linspace(-1,1,n_pts) # build some X data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# size of noise\n",
    "sigma = 1  \n",
    "# get the corresponding Y data for a perfectly linear relationship with Gaussian noise\n",
    "Y = X + (np.random.randn(n_pts)*sigma) \n",
    "_ = plt.scatter(X,Y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's consider a more complicated relationship between the variables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def f(X):\n",
    "    # cubic function of X\n",
    "    return -10*X*X*X - 2*X*X"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Switch to the familiar notation of training and test samples. \n",
    "\n",
    "Note: we will generate a single array of x-values, and draw test and training sets of y-values. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "X_train = X\n",
    "Y_train = f(X) + (np.random.randn(n_pts)*sigma)\n",
    "_ = plt.scatter(X_train,Y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's see how well oridinary linear regression does."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# http://jmduke.com/posts/basic-linear-regressions-in-python/\n",
    "\n",
    "def basic_linear_regression(x, y):\n",
    "    \"\"\"\n",
    "    Use least-square minimization to compute the regression coefficients\n",
    "    for a 1-dim linear model.\n",
    "    \n",
    "    parameters:\n",
    "        x: array of values for the independant (feature) variable \n",
    "        y: array of values for the dependaent (target) variable\n",
    "        \n",
    "    return value:\n",
    "        2-tuple of slope and y-intercept\n",
    "    \"\"\"\n",
    "    \n",
    "    # Basic computations to save a little time.\n",
    "    length = len(x)\n",
    "    sum_x = sum(x)\n",
    "    sum_y = sum(y)\n",
    "\n",
    "    # Σx^2, and Σxy respectively.\n",
    "    sum_x_squared = sum(map(lambda a: a * a, x))\n",
    "    sum_of_products = sum([x[i] * y[i] for i in range(length)])\n",
    "\n",
    "    # Magic formulae!  \n",
    "    a = (sum_of_products - (sum_x * sum_y) / length) / (sum_x_squared - ((sum_x ** 2) / length))\n",
    "    b = (sum_y - a * sum_x) / length\n",
    "    return a, b"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "B_1,B_0 = basic_linear_regression(X_train,Y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Make a plotting function that make a scatter plot of the data overlaid with the estimated regression function, $\\hat{f}$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def plot(X,Y,Y_hat):\n",
    "    \"\"\"\n",
    "    Plot data and estimated regression function\n",
    "    \n",
    "    Parameters:\n",
    "        X: independant variable\n",
    "        Y: dependant variable\n",
    "        Y_hat: estimate of the dependant variable; f_hat(X)\n",
    "    \"\"\"\n",
    "    plt.scatter(X,Y,label='data')\n",
    "    plt.plot(X,Y_hat,label='estimate',color='g',linewidth=2)\n",
    "    plt.legend()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "Y_hat_train = X_train*B_1 + B_0 \n",
    "\n",
    "plot(X_train,Y_train,Y_hat_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How can we quantify the quality of the regression?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def mse(Y_hat_train,Y_train,Y_test,print_results=True):\n",
    "    \"\"\" \n",
    "    Print mean squared error for test and train data\n",
    "    \n",
    "    Parameters:\n",
    "        Y_hat_train: estimated y-values for the training set\n",
    "        Y_train: true y-values for the training set\n",
    "        Y_test: true y-values for an independant test set, based on the _same_ x-values as Y_train. \n",
    "    \n",
    "    Return value:\n",
    "        tuple(training error, test error)\n",
    "    \n",
    "    \"\"\"\n",
    "    train_err = np.mean([abs(yh-y)**2 for y,yh in zip(Y_train,Y_hat_train)])\n",
    "    test_err = np.mean([abs(yh-y)**2 for y,yh in zip(Y_test,Y_hat_train)])\n",
    "    \n",
    "    if print_results:\n",
    "        print(\"train err: {0:.3f}\".format(train_err))\n",
    "        print(\"test err: {0:.3f}\".format(test_err))\n",
    "    else:\n",
    "        return train_err,test_err\n",
    "              "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# draw a _test_ sample from f(X)\n",
    "Y_test =  f(X) + (np.random.randn(n_pts)*sigma)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "mse(Y_hat_train,Y_train,Y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# k Nearest Neighbors\n",
    "\n",
    "Remember how kNN works: \n",
    "\n",
    "The value of the function $\\hat{f}$ is calculated at every point $x_0$ in X and is given by the __average__ of the $y$ values for the $k$ nearest neighbors in the training data.\n",
    "\n",
    "$\\hat{f}(x)=Average(y_i |~ x_i\\in N_k(x))$,\n",
    "\n",
    "where $N_k(x)$ is the set of $k$ nearest neighbor points to $x$.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def kNN(X,Y,x_0,k=20):\n",
    "    \"\"\"\n",
    "    Simple 1-D implementation of kNN average.\n",
    "    \n",
    "    Parameters:\n",
    "        X: the vector of feature data\n",
    "        x_0: a particular point in the feature space\n",
    "        k: number of nearest neighbors to include\n",
    "    \n",
    "    Return value:\n",
    "        The estimated regression function.\n",
    "    \n",
    "    For our purposes, think of a heapq object as a sorted list with many nice performance properties. \n",
    "    The first item is always the smallest. For items that are tuples, the default is to sort\n",
    "    by the first element in the tuple.\n",
    "    \"\"\"\n",
    "    nearest_neighbors = []\n",
    "    for x,y in zip(X,Y):\n",
    "        distance = abs(x-x_0)\n",
    "        heapq.heappush(nearest_neighbors,(distance,y))    \n",
    "    return np.mean( [heapq.heappop(nearest_neighbors)[1] for _ in xrange(k)] )\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "k = 15\n",
    "Y_hat_train_knn = [kNN(X_train,Y_train,x_0,k=k) for x_0 in X_train]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "plot(X_train,Y_train,Y_hat_train_knn)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "mse(Y_hat_train_knn,Y_train,Y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As $k\\rightarrow 1$, the model exactly matches the training data, and the training error goes to zero. But the test error increases as the variance goes up.\n",
    "\n",
    "# Kernel smoothing\n",
    "\n",
    "The function $N_k(X)$ is a kernel function. It defines how the data points contribute to the calculation of the regression function, as a function of $X$. We can think of $N_k$ as assigning weights of $0$ or $1$ to every point in the training data, as a function of $X$. \n",
    "\n",
    "We can generalize the kNN function above to calculate the weighted average of $Y_{train}$ at $X_0$, for an arbitrary kernel."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def kernel_smoother(X,Y,x_0,kernel,width):\n",
    "    \"\"\"\n",
    "    Generalization of 1-D kNN average, with custom kernel.\n",
    "    \n",
    "    Parameters:\n",
    "        X: the vector of feature data\n",
    "        x_0: a particular point in the feature space\n",
    "        kernel: kernel function\n",
    "        width: kernel width\n",
    "    \n",
    "    Return value:\n",
    "        The estimated regression function at x_0.\n",
    "    \"\"\"\n",
    "    kernel_weights = [kernel(x_0,x,width) for x in X]\n",
    "    \n",
    "    weighted_average = np.average(Y,weights=kernel_weights)\n",
    "    return weighted_average    \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def epanechnikov_kernel(x_0,x,width):\n",
    "    \"\"\"\n",
    "    For a point x_0 in x, return the weight for the given width.\n",
    "    \"\"\"\n",
    "    def D(t):\n",
    "        if t <= 1:\n",
    "            #return 3/4*float(1-t*t) <== why doesn't this work?\n",
    "            return float(1-t*t)*3/4\n",
    "        else:\n",
    "            return 0\n",
    "    return D(abs(x-x_0)/width)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# plot the Epanechnikov kernel at x_0 = 0, width = 1 to get a sense for it\n",
    "Y = [epanechnikov_kernel(0,x,1) for x in X]\n",
    "_ = plt.plot(X,Y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using the updateded kNN with an Epanechnikov kernel, make a better prediction function. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "width=0.35\n",
    "Y_hat_train_epan_kernel = [kernel_smoother(X_train\n",
    "                                           ,Y_train\n",
    "                                           ,x_0\n",
    "                                           ,kernel=epanechnikov_kernel\n",
    "                                           ,width=width) \n",
    "                           for x_0 in X_train]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "plot(X_train,Y_train,Y_hat_train_epan_kernel)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "mse(Y_hat_train_epan_kernel,Y_train,Y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are other kernels."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def tri_cube_kernel(x_0,x,width):\n",
    "    def D(t):\n",
    "        if t <= 1:\n",
    "            return float(1-t*t*t)**3\n",
    "        else:\n",
    "            return 0\n",
    "    return D(abs(x-x_0)/width)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# plot some kernels at x_0 = 0, width = 1 to get a sense for them\n",
    "Y1 = [epanechnikov_kernel(0,x,1) for x in X]\n",
    "Y2 = [tri_cube_kernel(0,x,1) for x in X]\n",
    "Y3 = [norm.pdf(x) for x in X]\n",
    "plt.plot(X,Y1,label=\"Epanechnikov\")\n",
    "plt.plot(X,Y2,color='g',label=\"tri-cube\")\n",
    "plt.plot(X,Y3,color='k',label=\"Gaussian\")\n",
    "plt.legend(loc='best')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "Y_hat_train_tri_kernel = [kernel_smoother(X_train\n",
    "                                          ,Y_train\n",
    "                                          ,x_0,kernel=tri_cube_kernel\n",
    "                                          ,width=width) \n",
    "                          for x_0 in X_train]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "plot(X_train,Y_train,Y_hat_train_tri_kernel)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "mse(Y_hat_train_tri_kernel,Y_train,Y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Local Linear Regression\n",
    "\n",
    "\n",
    "Manage the bias at the boundary by replacing the weighted average with a weighted linear fit.\n",
    "\n",
    "For each point $x_0$ in X: \n",
    "\n",
    " 1. use a kernel to get a set of weights for all points in the the training data\n",
    " 2. do a weighted, linear regression on those points (and weights) to determine the least-square parameters: the slope ($\\beta_1$) and y-intercept ($\\beta_0$).\n",
    " 3. calculate the estimated regression function at $x_0$: $\\hat{y}(x_0) = \\beta_0 + x_0 * \\beta_1$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LinearRegression\n",
    "\n",
    "def linear_kernel_model(X,Y,x_0,kernel,width):\n",
    "    \"\"\"\n",
    "    1-D kernel-smoothed model with local linear regression.\n",
    "    \n",
    "    Parameters:\n",
    "        X: the vector of feature data\n",
    "        x_0: a particular point in the feature space\n",
    "        kernel: kernel function\n",
    "        width: kernel width\n",
    "    \n",
    "    Return value:\n",
    "        The estimated regression function at x_0.\n",
    "    \"\"\"\n",
    "    kernel_weights = [kernel(x_0,x,width) for x in X]\n",
    "    \n",
    "    # the scikit-learn functions want something more numpy-like: an array of arrays\n",
    "    X = [[x] for x in X]\n",
    "    \n",
    "    wls_model = LinearRegression()\n",
    "    wls_model.fit(X,Y,kernel_weights)\n",
    "\n",
    "    B_0 = wls_model.intercept_\n",
    "    B_1 = wls_model.coef_[0]\n",
    "    \n",
    "    y_hat = B_0 + B_1*x_0\n",
    "    \n",
    "    return y_hat"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "Y_hat_train_linear_reg_epan_kernel = [\n",
    "    linear_kernel_model(X_train\n",
    "                        ,Y_train\n",
    "                        ,x_0\n",
    "                        ,kernel=epanechnikov_kernel\n",
    "                        ,width=width) \n",
    "    for x_0 in X_train]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "plot(X_train,Y_train,Y_hat_train_linear_reg_epan_kernel)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "mse(Y_hat_train_linear_reg_epan_kernel,Y_train,Y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How can we optimize the value of meta-parameters like the kernel width?\n",
    "\n",
    "Remember, the performance of many such parameters is correlated to variables like $\\texttt{n}\\_\\texttt{pts}$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def test_width(X_train,Y_train,Y_test,values,model=linear_kernel_model,kernel=epanechnikov_kernel):\n",
    "    \"\"\"\n",
    "    Make a plot of the test and training mse as a function of some parameter\n",
    "    \"\"\"\n",
    "    train_errors = []\n",
    "    test_errors = []\n",
    "    for width in values:\n",
    "        Y_hat_train = [\n",
    "            model(X_train\n",
    "                ,Y_train\n",
    "                ,x_0,kernel=kernel\n",
    "                ,width=width\n",
    "                ) \n",
    "        for x_0 in X_train]\n",
    "        train_err,test_err = mse(Y_hat_train,Y_train,Y_test,print_results=False)\n",
    "        train_errors.append(train_err)\n",
    "        test_errors.append(test_err)\n",
    "    plt.plot(values,train_errors,'g.',ms=10,label=\"train error\")\n",
    "    plt.plot(values,test_errors,'b.',ms=10,label=\"test error\")\n",
    "    plt.legend(loc='best')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "widths = np.linspace(0.001,1,50) # 50 evenly-spaced width values\n",
    "test_width(X_train,Y_train,Y_test,widths,model=linear_kernel_model,kernel=epanechnikov_kernel)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "width=0.2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Higher-dimensional data\n",
    "\n",
    "Make some functions of multi-dimensional input data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def f_2D(X):\n",
    "    return [(2*x[1]**3) + x[1]*x[0]*4 for x in X]\n",
    "def f_3D(X):\n",
    "    return [(-2*x[0]*x[0]) + (2*x[1]**3) + x[2]*x[1]*x[0]*4 for x in X]\n",
    "def f_nD_poly(X,n=2):\n",
    "    \"\"\"\n",
    "    Build a function that goes like x^n on each feature dimension x in X\n",
    "    \"\"\"\n",
    "    return [ sum([x**n for x in dim]) for dim in X]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generate random data in 2 dimensions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import random\n",
    "n_pts = 50\n",
    "X_train = np.array([[random.random(),random.random()] for _ in range(n_pts)])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "sigma = 0.1\n",
    "Y_train = f_nD_poly(X_train,2) + ( np.random.randn(n_pts) * sigma )\n",
    "Y_test = f_nD_poly(X_train,2) + ( np.random.randn(n_pts) * sigma )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Test first with a multi-dimensional kNN."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def kNN_multiD(X,Y,x_0,k=20,kernel_pars=None):\n",
    "    \"\"\"\n",
    "    Simple multi-dimensional implementation of kNN average.\n",
    "    \n",
    "    Parameters:\n",
    "        X: the vector of feature data\n",
    "        x_0: a particular point in the feature space\n",
    "        k: number of nearest neighbors to include\n",
    "    \n",
    "    Return value:\n",
    "        The estimated regression function at x_0.\n",
    "\n",
    "    Note: use numpy.linalg.norm for N-dim norms.\n",
    "    \"\"\"\n",
    "    nearest_neighbors = []\n",
    "    for x,y in zip(X,Y):\n",
    "        distance = np.linalg.norm(np.array(x)-np.array(x_0))\n",
    "        heapq.heappush(nearest_neighbors,(distance,y))    \n",
    "    return np.mean( [heapq.heappop(nearest_neighbors)[1] for _ in xrange(k)] )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Remember that, because the dimensionality of the features is no long one, the scale of the error will be different. So don't compare the numbers below with those from the 1-D examples above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "Y_hat_train = [\n",
    "    kNN_multiD(X_train,Y_train,x_0,k=k) \n",
    "    for x_0 in X_train]\n",
    "mse(Y_hat_train,Y_train,Y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Multi-dimensional versions of kernel and model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def epanechnikov_kernel_multiD(x_0,x,width=1):\n",
    "    def D(t):\n",
    "        #print(\"width = {}\".format(width))\n",
    "        if t <= 1:\n",
    "            return float(1-t*t)*3/4\n",
    "        else:\n",
    "            return 0\n",
    "    return D(np.linalg.norm(np.array(x)-np.array(x_0))/width)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's also generalize the model so that the regression need not be simple and linear."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def generalized_kernel_model(X,Y,x_0,kernel=epanechnikov_kernel_multiD,kernel_pars={},regressor=LinearRegression):\n",
    "    \"\"\"\n",
    "    Multi-D kernel-smoothed model with local generalized regression.\n",
    "    \n",
    "    Parameters:\n",
    "        X: the vector of feature data\n",
    "        x_0: a particular point in the feature space\n",
    "        kernel: kernel function\n",
    "        width: kernel width\n",
    "        regressor: regression class - must follow scikit-learn API\n",
    "    \n",
    "    Return value:\n",
    "        The estimated regression function at x_0.\n",
    "    \"\"\"\n",
    "    kernel_weights = [kernel(x_0,x,**kernel_pars) for x in X]\n",
    "    model = regressor()\n",
    "    model.fit(X,Y,sample_weight=kernel_weights)\n",
    "  \n",
    "    return model.predict([x_0])[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.linear_model import Ridge,Lasso,ElasticNet\n",
    "regressor=LinearRegression\n",
    "\n",
    "width = 0.5\n",
    "Y_hat_train = [\n",
    "    generalized_kernel_model(X_train\n",
    "                             ,Y_train\n",
    "                             ,x_0\n",
    "                             ,kernel=epanechnikov_kernel_multiD\n",
    "                             ,kernel_pars={\"width\":width}\n",
    "                             ,regressor=regressor) \n",
    "    for x_0 in X_train]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Compare the MSE here to that of the kNN above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "mse(Y_hat_train,Y_train,Y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Experiments\n",
    "\n",
    "Build an API that allows you to optimize any parameter by visualizing the test and training errors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def test_parameter(parameter_name,values,args):\n",
    "    \"\"\"\n",
    "    Make a plot of the test and training mse as a function of some parameter\n",
    "    \n",
    "    Parameters:\n",
    "        parameter name:\n",
    "        values: \n",
    "        args:\n",
    "    \n",
    "    \"\"\"\n",
    "    train_errors = []\n",
    "    test_errors = []\n",
    "    \n",
    "    # get the dictionary element and shortened name for the variable parameter\n",
    "    par_name = \"\"\n",
    "    \n",
    "    X_train = np.array([[random.random() for _ in range(args['main']['n_dim'])] for _ in range(args['main']['n_pts'])])\n",
    "    Y_train = args['main']['func'](X_train) + ( np.random.randn(args['main']['n_pts']) * args['main']['sigma'] )\n",
    "    Y_test = args['main']['func'](X_train) + ( np.random.randn(args['main']['n_pts']) * args['main']['sigma'] )\n",
    "    \n",
    "    for value in values:\n",
    "        # set the value of the variable parameter for this iteration\n",
    "        location = args\n",
    "        for key_num,key in enumerate(parameter_name.split(':')):\n",
    "            par_name = key\n",
    "            if key_num+1 < len(parameter_name.split(':')): \n",
    "                location = location[key]\n",
    "            else:\n",
    "                location[key] = value  \n",
    "        \n",
    "        Y_hat_train = [\n",
    "            args['main']['model_name'](X_train\n",
    "                ,Y_train\n",
    "                ,x_0\n",
    "                ,kernel=args['main']['kernel']\n",
    "                ,**args['model']\n",
    "                ) \n",
    "        for x_0 in X_train]\n",
    "                \n",
    "        train_err,test_err = mse(Y_hat_train,Y_train,Y_test,print_results=False)\n",
    "        train_errors.append(train_err)\n",
    "        test_errors.append(test_err)\n",
    "        \n",
    "    plt.plot(values,train_errors,'g.',ms=15,label=\"train error\")\n",
    "    plt.plot(values,test_errors,'b.',ms=15,label=\"test error\")\n",
    "    plt.title(par_name)\n",
    "    plt.legend(loc='best')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "arguments = {\n",
    "    \"main\":{\n",
    "        \"func\":f_nD_poly,\n",
    "        \"model_name\":generalized_kernel_model,\n",
    "        \"kernel\":epanechnikov_kernel_multiD,\n",
    "        \"n_pts\":30,\n",
    "        \"sigma\":0.1,\n",
    "        \"n_dim\":2\n",
    "    },\n",
    "    \"model\":{\n",
    "        \"regressor\":LinearRegression,\n",
    "        \"kernel_pars\":{\n",
    "            \"width\":0.2\n",
    "        }\n",
    "    }\n",
    "}\n",
    "test_parameter(\"model:kernel_pars:width\",np.linspace(0.01,1.5,30),arguments)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
